Align Me: A framework to generate Parallel Corpus Using OCRs and Bilingual Dictionaries
نویسندگان
چکیده
Multilingual processing tasks like statistical machine translation and cross language information retrieval rely mainly on availability of accurate parallel corpora. Manual construction of such corpus can be extremely expensive and time consuming. In this paper we present a simple yet efficient method to generate huge amount of reasonably accurate parallel corpus with minimal user efforts. We utilize the availability of large number of English books and their corresponding translations in other languages to build parallel corpus. Optical Character Recognition systems are used to digitize such books. We propose a robust dictionary based parallel corpus generation system for alignment of multilingual text at different levels of granularity (sentence, paragraphs, etc). We show the performance of our proposed method on a manually aligned dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences.
منابع مشابه
Unsupervised Word Mapping Using Structural Similarities in Monolingual Embeddings
Most existing methods for automatic bilingual dictionary induction rely on prior alignments between the source and target languages, such as parallel corpora or seed dictionaries. For many language pairs, such supervised alignments are not readily available. We propose an unsupervised approach for learning a bilingual dictionary for a pair of languages given their independently-learned monoling...
متن کاملGenerating Cross-lingual Concept Space from Parallel Corpora on the Web
The information available in languages other than English on the World Wide Web is increasing significantly. To cross language boundaries between different languages, dictionaries are the most typical tools. However, the general-purpose dictionary is less sensitive in genre and domain and it is impractical to manually construct tailored bilingual dictionaries or sophisticated multilingual thesa...
متن کاملBilingual phrase-to-phrase alignment for arbitrarily-small datasets
This paper presents a novel system for sub-sentential alignment of bilingual sentence pairs, however few, using readily-available machine-readable bilingual dictionaries. Performance is evaluated against an existing gold-standard parallel corpus where word alignments are annotated, showing results that are a considerable improvement on a comparable system and on GIZA++ performance for the same ...
متن کاملFlexible Statistical Construction of Bilingual Dictionaries
Most previous systems for constructing a bilingual dictionary from a parallel corpus have depended on an iterative algorithm, using word translation probabilities to align words in the corpus, and using word alignments to estimate word translation probabilities, and repeating until convergence. While this approach produces reasonable results, it is computationally slow, limiting the size of the...
متن کاملCross-Lingual Bootstrapping of Semantic Lexicons: The Case of FrameNet
This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generate frame candidate lists for new languages, which are subsequently pruned automatically using a sm...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016